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Abstract. In many problems in Computational Physics and Chemistry, one finds 
a special kind of sparse matrices, called banded matrices. These matrices, which are 
defined as having non-zero entries only within a given distance from the main diagonal, 
need often to be inverted in order to solve the associated linear system of equations. 
In this work, we introduce a new 0(n) algorithm for solving such a system, with 
the size of the matrix being n x n. We derive analytical recursive expressions that 
allow us to directly obtain the solution. In addition, we describe the extension to deal 
with matrices that are banded plus a small number of non-zero entries outside the 
band, and we use the same ideas to produce a method for obtaining the full inverse 
matrix. Finally, we show that our new algorithm is competitive, both in accuracy and 
in numerical efficiency, when compared with a standard method based on Gaussian 
elimination. We do this using sets of large random banded matrices, as well as the 
ones that appear in the calculation of Lagrange multipliers in proteins. 
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1. Introduction 

In this article we present the efficient formulae and subsequent algorithms to solve the 
system of linear equations 



where A is a n x n matrix, x is the fix 1 vector of unknowns, b is a given n x 1 vector 
and A satisfies the equations below for known values of m u , mi < n 



i. e., A is a banded matrix and (1) is a banded system. We also investigate how to solve 
similar systems where there are some non-zero entries not lying in the diagonal band. 

Banded systems like this are abundant in computational physics and computational 
chemistry literature, especially because the discretization of differential equations, 
transforming them into finite-difference equations, often results in banded matrices 
[1,2]. Many examples of this can be found in boundary value problems [3-5], in fluid 
mechanics [6-8], thermodynamics [9], classical wave mechanics [3], structure mechanics 
[10], nanoelectronics [11], circuit analysis [12], diffusion equations and Maxwell's first- 
order curl equations [2]. In quantum chemistry, finite difference methods using banded 
matrices are used both in wavefunction formalism [13-15] and in density functional 
theory [16, 17]. In addition to finite-difference problems, banded systems are present in 
several areas, such as constrained molecular simulation [18-20], including the calculation 
of Lagrange multipliers in classical mechanics [21]. An important case for the calculation 
of Lagrange multipliers deals with molecules with angular constraints. The banded 
method presented here is suitable to calculate the associated Lagrange multipliers 
exactly and efficiently [22]. Banded matrix techniques are useful not only in linear 
systems, but also in linearized ones, which also appear frequently in the literature [4, 6- 



The solution of a linear system with A being a n x n dense matrix requires 
0(n 3 ) floating point operations^ (a floating point operation is an arithmetic operation, 
like addition, subtraction, multiplication and ratio, involving real numbers which are 
represented in floating point notation, the customary nomenclature in computers). 
However, banded systems can be solved in 0{nm u mi) floating point operations using 
very simple recursive formulae, and the explicit form of A^ 1 can be obtained in 0{n 2 ) 
floating point operations. As mentioned earlier, there exist a number of physical 
problems whose behaviour is described by banded systems where m u , mi <C n. This 
makes it possible to get large computational savings if suitable algorithms are used, 
what is even more important for computationally heavy problems like those in which the 
calculation of relevant quantities requires many iterations (Molecular Dynamics [14, 24] , 
Monte Carlo simulations [25], quantum properties calculations via self-consistent field 
equations [26,27], etc.). 

\ As stated in [23], this can be reduced to 0{n lo ^ 7 ~ 2 - m7 ). 



Ax = b 



(1) 



A u+K = V K > m u , V/ 
A I+LJ =0 V L > m t , V/ , 



(2) 
(3) 



9,18]. 
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In this work, we introduce a new algorithm for solving banded systems and inverting 
banded matrices that presents very competitive numerical properties, in many cases 
outperforming other commonly used techniques. Additionally, we provide the explicit 
recursive expressions on which the algorithm is based, thus facilitating further analytical 
developments. The linear (0(n)) scaling of the presented algorithm is a remarkable 
feature since efficiency is commonly essential in today's computer simulation of physical 
systems, specially in fields such as molecular mechanics [28-30] (using efficient force 
fields) and quantum ab initio methods [31-33]. 

The article is structured as follows: in section 2, we derive simple recursive formulae 
for the efficient solution of a linear banded system. These formulae enable the solution 
of (1) in 0(nm u mi) floating point operations and are suitable to be used in a serial 
machine. In section 3, we extend these formulae to systems where some entries outside 
the band are also non-zero. In section 6, we briefly discuss the differences between 
our new algorithm and one standard method to solve banded systems. In sec. 7 we 
quantitatively compare the performance of both algorithms in terms of accuracy and 
numerical cost. For this comparison, in sec. 7.1 we use randomly generated banded 
systems for inputs. In section 7.2 we apply the algorithms to the problem of calculating 
the Lagrange multipliers which arise when imposing holonomic constraints on proteins. 
Finally, in section 8, we state the most important conclusions of the work. In the 
Appendix we provide equations for the explicit expression of the entries of A -1 . Some 
remarks on the algorithmic implementation of the methods presented here, their source 
code and remarks on their parallelization can be found in the supplementary material. 

2. Analytical solution of banded systems 

One of the most common ways of solving the linear system in equation (1) is by gradually 
changing the different entries of the matrix A to zero through the procedure of Gaussian 
elimination [34-36]. This procedure is based on the possibility of writing A as A = LU, 
where L is a lower triangular matrix and U is an upper triangular matrix. This way 
of writing A, called LU- decomposition, is possible (i.e., L and U exist), if, and only 
if A is invertible and all its leading principal minors are non-zero [37]. If one of the 
two matrices L or U is chosen to be unit triangular, i.e., with l's on its diagonal, the 
matrices not only exist but are also unique. 

The analytical calculations and algorithms introduced in this work are based on a 
different but closely related property of A, namely, the possibility of finding Q (a lower 
triangular matrix) and P (an upper triangular one), so that we have 

QAP = I A' 1 = PQ , (4) 

where I is the identity matrix. 

The requirements in order for these two matrices to exist are the same as those 
in the /.[/-decomposition, because in fact, the two propositions are equivalent. The 
existence of a 'QP- decomposition' arises from the existence of the LU decomposition. 
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This trivially proved if we make Q = L~ x and P = U~ x . The converse implication 
follows from the following facts. A must be invertible so that equation (1) has a unique 
solution. The fact that its determinant (det A) is different from zero and the relation 
det Q det A det P = det 1=1 force both Q and P to have non-zero determinants and 
therefore to be invertible. This enables one to write A = Q~ 1 P~ 1 , and since the inverse 
of a triangular matrix is a triangular matrix of the same kind, we can identify L = Q -1 
and U = P~ l thus proving the existence of the /^[/-decomposition. This equivalence also 
enables one to say that, as long as one of the two matrices Q and P is unit triangular, 
the Q-P-decomposition is unique. 

An important qualification of this situation is that in order to solve the system in 
equation (1), using QP (or LU) decomposition is not the only option. We can also 
solve the system by performing a Gaussian elimination process that is based on the 
QP (or LU) decomposition of a matrix A, which is obtained from A by permuting its 
rows and/or columns. If these permutations are performed (what is called pivoting), 
the condition for QAP = I (or A = LU) to hold is simply that A is invertible. 
Typically, the algorithms obtained from the pivoting case are more stable. For the sake 
of simplicity, derivations of this paper deal only with the non-pivoting case (the reader 
should notice that pivoting can be included in the debate with minor adjustments). In 
the supplementary material, algorithms including and lacking pivoting can be analysed. 

Let us now build the matrices P and Q that satisfy (4) for a given matrix A. When 
we have obtained them, they can be used to compute the inverse A -1 , and then we will 
be able to solve (1). However, in this section (see also ref. [38]) we will see that there is 
no need to explicitly build A -1 , and the information needed to calculate P and Q can 
be used in a different way to solve (1). 

We begin by writing P and Q as follows 



n 



P := 



P1P2 



...P n = Y[Pi 



K , 



(5a) 



K=l 



1 



Q ■ QnQn—l 



...Qi = 11 Q. 



K , 



(5b) 



K=n 



being 



/ 1 



\ 



1 



^KK Z,K,K+1 
1 




(6) 



1 



V 
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and 



/ 1 



Q 



K 



^K+mi,K 



(7) 



V i / 

where K — 1, . . . , n, and all the non-specified entries are zero. Note that Pk equals the 
identity matrix except in its K-th row, and Qk equals the identity matrix except in its 
K-th column. 

Now, the trick is to choose all coefficients C,u in the preceding matrices so that we 
have QAP = I in (4) (whenever the conditions for this to be possible are satisfied; see 
the beginning of this section). 

First we must notice that given (6), multiplying a generic matrix G (by its right) 
by P K is equivalent to adding the K-th column of G multiplied by the corresponding 
£ coefficients to several columns of G, while at the same time multiplying the K-th 
column of the original matrix by £kk' 



(GPk)ij = Gu 
(GPk)ik = Gik^kk , 
(GPk)ij = Gu + Gik^KJ 



for J < K and J > K + m v 



for K < J < K + m u 



(8a) 
(86) 
(8c) 



If we take this into account, we can choose £n so that (APi) u = 1, and (APi)u = for 
J — 2, ... ,n. Given the fact that A is banded (see, in particular (2, 3)), we have that 



(AP 1 ) 11 
(AP^j 



Antn = 1 
Au + An^j 







l/An , 
A u 



1 < J < 1 + m v 



Lll 



(9 a) 
(96) 



Operating in this way, we have 'erased' (i.e., turned into zeros) the superdiagonal entries§ 
of A that lie on its first row, and we have done this by multiplying A on the right by Pi 
with the appropriate £u. Then, if we multiply (APi) on the right by P 2 and choose the 
coefficients £ 2 j in the analogous way, we can erase all the superdiagonal entries in the 
second row and change its diagonal entry to 1. In general, multiplying (APi ■ ■ ■ Pk-i) 
by Pk erases the superdiagonal entries of the K-th row of (AP± ■ ■ ■ Pk-i), and turns its 
diagonal KK entry to 1. This procedure is called Gaussian elimination [37], and after 
n steps, the resulting matrix is a lower unit triangular matrix A YYk=i Pk = AP. 

The expression for the coefficients with I < J and / > 1 is more complex than 
(9a) because, as a consequence of (8a, 8b, 8c), whenever we multiply a matrix on the 



§ We call superdiagonal entries of a matrix A to the entries Au with I < J, and subdiagonal entries 
to the entries Ajj with I > J. 
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right by Pk, not only is its K-th row (the one we are erasing) affected, but also all the 
rows below are affected (the mi rows below in the case of a banded matrix like (4)). 
However, the matrix A Y\l=i Pl is in all its superdiagonal entries belonging to the 
first K — 1 rows, and multiplying it on the right by Pk has no influence on these rows. 
Hence, the fact that we have chosen to erase the superdiagonal entries of A from the 
first row to the last row allows us to express the general conditions that the £ coefficients 
belonging to different -fV's must satisfy the following: 

J[Pk) = 1 , (10a) 
f[ P K J =0 for / < J . (10b) 

K=l J I J 

Now, using (10a) together with (8b), we can derive the following expression for the 
coefficient £// in terms of the previous steps of the process: 

a n p « ) = ( a ri p « p i ) = ( a ri p * ) & = 1 

K=l J 11 V K=l ) 11 \ K=l J n 

=► tn = ? j\ r-. (11) 

Analogously, using (106) and (8c), we can write an explicit expression for £/j with 
I < J: 

a\[p k pA =[a\[Pk\ +(^n p ^) ^ = 

K=l Ju \ K=l J U V K=l ) u 



HJ 




(12) 



( a U i k -1 1 p k ) ii 

Also according to (8a, 8b, 8c), for I < J 

l / M-l \ 

^n^) =Au+ Yl ( a II Pk ) I>L - ( 13 ) 

j M=J-m u \ J£T=1 / jm 

Note that, in this equation we have / > M\\, which entails that ^n^=i 
are subdiagonal entries. This enables one to calculate the coefficients C,u with I > J, 
i.e., those that correspond to the matrices Qk, once all the coefficients in the matrices 
Pk have already been evaluated. We know that AP is a unit lower triangular matrix. 
This means that its subdiagonal /, M entry (with / > M) equals because no other 
changes affect this entry when multiplying AP by the different Q^s. If G is a generic 
matrix, then (7) implies 

(QkG)u = Gu for / < K and / > K + m { , (14a) 

(QkG) I j = G ij + GkAik ioi K<I<K + m h for all J . (146) 
|| Because I > L by hypothesis, and L > M. 
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If T is any unit lower triangular matrix (this is, its diagonal entries equal 1 and its 
superdiagonal entries are zero, T u = 1, Tjj = for I < J), then (7) implies 

(QkT)u = Tjj for / < K and / > K + m x , (15a) 

(QkT)u = Tjj + T KJ i lK iorK<I<K + m h J<K , (156) 
{QkT) I j = T ij for K <I <K + m h J> K . (15c) 

Moreover, if Sk-i is a unit lower triangular matrix satisfying (Sk-i)ij = for / > J, 
J < K, then equations (15a, 156, 15c) become 

(Qk(S K -i))ij = (S K -i)u for / < and / > K + m, , (16a) 
(Qk(S K -i))ij = (S K -i)ik + (S K -i)kk£ik for K < I < K + mi , (16b) 
(Qk(S K -i))ij = (S K -i)u for K < I < K + mi, J > K . (16c) 

Multiplying Qi on the left by (AP) erases the subdiagonal entries of the first 
column of (AP), and multiplying (Q\AP) on the left by Q2 erases the subdiagonal 
entries of the second column of (Q\AP). By repeating this procedure, multiplying 
Qk by (Qk-i ■ ■ ■ QiAP) by the left, the subdiagonal entries of the K column of 
(Qk-i ■ ■ ■ QiAP) are erased. Therefore, (Qk-i • • • QiAP) satisfies the conditions of 
S K -i, and hence it satisfies equations (16a, 166, 16c). The conditions for the correct 
erasing are: 

(Qk(Qk-i---QiAP)) ik = (17) 
(Qk-i ■ ■ ■ QiAP) IK + (Qk-i • • • QiAP) K kCik = for K < I < K + m t . 
The expressions in (17) can be simplified. Equation (16c) implies 

(Qk-i ■ ■ ■ QiAP) IK = (Q K -i(Qk-2 • • • QiAP)) IK = 
(Qk'(Qk'-i ■ ■ ■ QiAP))i t K'+i = (Qk'-i ■ ■ ■ QiAP)i t K'+i = 
(Qk-2---QiAP) ik , (18a) 

where we have defined K' :— K — 1. By repeating operations like this, it is easy to 
obtain 

(Qk-i---QiAP) ik = (AP) IK hiK<I<K + mi , (19a) 
(Qk-i ■ ■ ■ QiAP) KK = (AP)kk = 1 • (196) 
In addition, we must consider that (8a, 86, 8c) imply 

M-l \ / n \ 

^11^ = ^11^ /Zmm = (AP) im /Zmm for/>M.(20) 

L=l J IM V L=l J im 

Using (19a, 196, 20) into (17) we obtain 

' M-l \ 

A ri Pi ) =-iiMliuM for/>M. (21) 

L=l J IM 

If we apply this on the right hand side of (13), then insert the resulting expression 
with J = I and L = I — 1 into (11) and also insert it with L — I — 1 into (12), we get 
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Oil = (An - ]T ^ MI ) , (22a) 



Zu = Zii(-A I j+ V P^Zmj) for I < J . (22b) 

\ M=I-m u ^ MM 

Following an analogue procedure, we obtain 

7 1 e 



0j = 6/ E I^^mj for 7 > J. (23) 



For efficiency reasons, we prefer to define xu '■= iu/iii for I > J ■ This makes (22a, 
22 6, 23) become 

An- XimZmiJ , (24a) 

M=I-m u / 

iu = in i-Au + Yl XiuiuA for / < J , (246) 

\ M=I-m u / 

J-l 

X/j = -A I3 + E X/mCmj for I > J ■ (24c) 

M=J-m l 

The three equations above can be further modified with the aim of improving the 
numerical efficiency of the algorithms derived from them. The starting point for the 
summations in (24a) must be the value of M such that both £im and £mj are non-zero. 
We must take into account that in a banded matrix the number of non zero entries 
above and on the left of the J, J entry depends on the values of I, J: 

• There are m u + (I — J) non-zero entries immediately above Ajj. 

• There are mi — (I — J) non- zero entries immediately on the left of Ajj. 

These properties are also satisfied in A Y\l=i ?l for all K. Therefore, if we define 
Hn := min{m u + (I — J), mi — (I — J)} , 
// := min{m u ,m/} , 

we can re-express (24a, 246, 24c) as 

( - 1 V 

in = I A H - ^ Xim£mi , (25 a) 

\ M=max{l,7-/i'} J 

\ 

A! j + Yl XimZmj for / < J , (256) 




M=max{l,J— nu} 
J-l 

Xu = -An + Y Xiuiuj for I > J ■ (25c) 

M=max{l, J— Hij} 
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In the restricted but very common case in which mi = m u =: m, the previous 
equations become 

-l 



1-1 



in = \A n - ^2 Xim^mi , (26 a) 

y M=max(l,I-m) J 

tu = in I -At j + Xiuiuj j for / < J , (266) 

y M=max{l,J-m} J 

J-1 

Xu = -Au + ^2 XimZmj for I > J • (26c) 

M=max{l,/— m} 

If the matrix A is symmetric (Au = Aji), we can avoid performing many operations 
simply by using 

Xu = 0i/0j, for I>J, (27) 

instead of (25c). Equation (27) can easily be obtained from (25a) by induction. 

The reader must also note that, although the coefficients C,u have been obtained by 
performing the products []| (=n QrA YYl=i i n a certain order, they are independent 
of this choice. Indeed, if we take a look to expressions (5a), (5b), (6), and (7), 
we can see that the K-th row (or column) is always erased before the (K + 1)- 
th one. It does not matter if we apply first Qk or P K to erase the K row (or 
column); the result of the operation will be the same. In both cases, —GikGkj/Gkk 
(where G := nL=R--i QmA YIl=i ^l) is added to all the entries of G u such that 
I e {K+l, . . .,K+mi} and J G {K+l, . . .,K+m u }. This is valid when both the K-th 
row and the i^-th column are not erased yet. If one of them is already erased, erasing the 
other has no influence on Gu with I e {K+l, . . . , K+mi} and J e {K+l, . . . , K+m u }. 
In both cases £ik = —Gjk/Gkk for I > K, and £ KJ = —Gkj/Gkk for J > K. This 
is because all the previous rows (or columns) have been nullified before, and adding 
columns (or rows) has no influence on the i^-th one. 

Now, the algorithm to solve (1) can be divided into three stages (in our 
implementation we join together the first and second ones). Since A~ l = PQ (A), 
these steps are: 

(i) To obtain the coefficients £. 

(ii) To obtain the intermediate vector c := Qb. 

(iii) To obtain the final vector x = Pc. 

Now, using the results derived above, let us calculate the expressions for the second 
and third steps: 

Whenever we multiply a generic n x 1 vector v on the left by Qk (see (7)), we 
modify its K-th. to (K + m^)-th rows in the following way: 

(Q A -u) J = u / iorI<K, I>K + m x , (28a) 

(Qkv)i = vi + £ ik v k for K < I < K + m, . (286) 
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Since Q := Q n Qn-i ■ ■ ■ Qi, using the expression for each of the Qk in (7), and the fact 
that £u = for I > J + m h we have 

Q u = for I < J , (29a) 

Qn = 1 , (296) 

Q 7 j = ^imQmj for 7 > J . (29c) 

M=max{7— m;,l} 

where the maximum in the lower limit of the sum accounts for boundary effects and 
ensures that M is never smaller than 1. 

From these relations between the entries of Q, we can get the components cj of the 
intermediate vector c in the second step above: 

n I 7-1 ( 7-1 

J=l J=l J=l \M=ma,x{I- mi ,l} 

I-l 7-1 1-1 

= 6/ + ^2 & m ^2 q Mjkj = hi + ^2 ^mcm 

M=max{7— m;,l} J=l M=max{7— m ; ,1} 

7-1 

= 67 + ^ XimCmmCm ■ (30) 

M=max{7— m; ,1} 

In the first row of the equation above, we applied (29a), and then (29c). In the second 
row of the equation above, we performed a feedback in the equation. 

We will now turn to the third and final step of the process, which consists of 
calculating the final vector x = Pc. Whenever we multiply a generic nx n matrix G on 
the left by Pk (see equation (6)), the resulting matrix is the same as G in all its rows 
except for the K-th one, which is equal to a linear combination of the first m u + 1 rows 
below it: 

(P K G) U =Gjj iorl^K , (31a) 

mm{K +m u ,n} 

(PkG) K j= J2 ZklG L j, (31b) 

L=K 

where the minimum in the upper limit of the sum accounts for boundary effects and 
ensures that K + L is never larger than n. 

If we now use the relations above to construct P as in (5a), i.e., we first take P n 
and multiply it on the left by P n -i, then we multiply the result, P n _iP n , on the left by 
P n -2, etc., we arrive to: 

Pn = 67 , (32a) 

min{7+m I1 ,n} 

Pu= J2 & kPkj for/<J, (326) 

K=I+l 

Pu = for I > J , (32c) 
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meaning that every row of P is a linear combination of the following rows, plus a term 
in the diagonal. 

These expressions allow us to obtain the last equation that is needed to solve the 
linear system in (1): 

ran ra / mm{I+m u ,n} \ 

xi = ^ p u c J = ^2 p u c J = £n c i + ^ ^2 &kPkj Cj 

J=l J=I J=I+1 y K=I+1 J 

min{/4-m„,n} n 

= inci + ^2 & k ^2 p kj°j 

K=I+1 J=I+1 
min{I+m u ,n} 

= inci + ^2 iiKX K ■ (33) 

K=I+1 

Now, we can use expressions (25a), (256), and (25 c) in order to obtain the 
coefficients £, and then plug them into (30) and (33) in order to finally solve (1). 

To conclude, let us focus on the computational cost of this procedure. From (25a), 
(256), and (25c), it follows that obtaining the coefficients £ requires 0(n) floating point 
operations. Being more precise, the summations in (25a), (256), and (25c), require nu 
products and fijj — 1 additions (2/i/j — 1 floating point operations)^. If, without loss 
of generality, we consider m u > mi, it is easy to check that the following computational 
costs hold: 

• Obtaining one diagonal £/•/ takes 2// + 2 ~ 2m u floating point operations. 

• Obtaining one superdiagonal coefficient C,u (where I < J) takes about 
2min{m;,m u — (J — I)} floating point operations. Hence, in order to obtain all 
the coefficients in a column above the diagonal, there are two sets of £'s that 
require a different number of operations. The lower one requires 2m/ floating point 
operations and the upper one requires 2{m u — (J — I)) floating point operations. 
All in all, obtaining £j-/,J for / = 1, . . . , m u takes about mi(2m u — mi) floating 
point operations. 

• In order to obtain the £/j coefficient in a subdiagonal row (I > J), the number of 
floating point operations to be performed is min{m u , mi — (I — J)} = mi — (I — J); 
this is performed in such a way that the total number of floating point operations 
related to this row is approximately mf. 

Finally, obtaining all coefficients £ would require slightly less operations (due to 
the boundary effects) than 2nm u mi floating point operations. Once they are known (or 
partly known during the procedure to get them), we can obtain the solution vector x 
using the simple recursive relationships presented in this section at a cost of 4n(m u + mi) 
floating point operations. 

The meaning of m u ,m; can be noticed in (2, 3). 
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3. Banded plus sparse systems 

A slight modification of the calculations presented in the previous section is required 
to tackle systems where not all the non-zero entries are within the band. The resulting 
modified procedure is described in this section. 
If we have 

T 

^ max 

A'^A+J2^r t sA RtSt , (34) 

T=l 

with A banded (see eqs. (2) and (3)) and the matrix A RtSt consisting of entries 
(A RtSt ) ij = Sj t R T Sj t s T , o~u being the Kroenecker delta, we shall say that A' is a banded 
plus sparse matrix, and 

A'x = b (35) 

a banded plus sparse system. We call an extra-band entry any nonzero entry which does 
not lie in the band (this is, A\ 3 is an extra-band entry if it is not zero and I < J, 
J > I + m u or I>J,J>J + mi). 

In the pure banded system (section 2) only £k,k+j and £k+i,k with K — 1, . . . , n; 
J = 1, . . . , m u ; I — 1, . . . , mi had to be calculated. In this case, we also need to obtain 

£/s T if R T < S T , for / = R T , R T + 1, . . . , S T — rn u — 1 , 
£ Rt j if R T > S T , for J = S T , S t + 1, . . . , R T - m t - 1 , 

with T = 1, . . . ,T max . 

As seen in the previous section, in order to erase (i.e., turn to 0) entry Gu, with 
/ < J, of a generic matrix G, we can multiply it by a matrix Pj (see (6, 8a, 86, 8c)). 
This action adds the column / (times given numbers) of matrix G to other columns of 
G. This erases Gjj, but (in general) adds nonzero numbers to the entries below it [KJ 
entries with K > I). Therefore, if these entries were zero before performing the product 
GPj, they will in general be nonzero after it. This implies that they will also have to be 
erased. Hence, erasing the extra-band entry 1.1 of A 1 will not suffice; the entries I + 1, J, 
I + 2, J, . . ., J — m u + 1, J will also have to be erased. If the extra-band entry to erase 
A'jj is below the diagonal (I > J), then the Qj matrices (7, 14a, 14b) can be used to 
this end, since they add rows when multiplied by a generic matrix. Erasing A' u will 
probably make that entries A' IK with K = J + 1, . . . , I — mi — 1 become nonzero, and 
these entries will have to be also erased. 

In order to erase the extra-band entries, the expressions presented in the previous 
section can be used. All extra-band entries can lie in an extended band wider than the 
original band. But, for the sake of efficiency, the entries in the extended band which are 
zero during the erasing procedure must not enter the sums for the coefficients £, x- 

We define 



v RtI := max{R T , I - m t } , 
ps T j := max{S' T , J + m u } . 



(36 a) 
(366) 
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If Rt < St (superdiagonal extra-band entry), in addition to coefficients appearing in 
(25a, 256, 25 c) we have to calculate 

£r t S t = -£r t Rt A 'r t St ' ( 37 °) 



J-l 



UJ 



£11 ^2 Xim£mj \ for R T < I < S T - m u ; (376) 



yM=U 



and if Rt > St (subdiagonal extra-band entry), in addition to coefficients appearing in 
(25a, 256, 25 c) we have to calculate 

^ t5t = ~A' RtSt , (38a) 
j-i 

Xu = ^2 &mXmj foi S T < J < I - mi . (386) 

M=p 

The coefficients appearing in (37a, 376, 38a, 386) arise from merely applying equations 
(25a, 256, 25c) and avoiding to include in them the coefficients £, x which are zero due 
to the structure of A'. 

Equations (37a, 376, 38a, 386) have to be modified for / < J if there exist A' RxSt 
with R x < Rt- This is because erasing the upper non-zero entries by adding columns 
creates new non-zero entries below them, and the new relations must take this into 
account. Analogous corrections must be done for / > J if there exist A' RtSx with 
S x > St- The general rule to proceed in sparse plus banded systems is to apply equations 
(25a, 256, 25c) using the maximum m' u , m\ so that all the nonzero entries of A' lie within 
the enhanched band (given by m' u , mj), and avoid that the coefficients (£, x) which are 
zero take part in the sums. The coefficients £xl which are zero are those given by the 
following rules: 

• If K < L, i KL = if A' ML = for M = 1, . . . , K 

• If K > L, i KL = if A' KM = for M = 1, . . . , L 

The computational cost of solving banded plus sparse systems scales with n, as long 
as the number of columns above the band and rows below it containing non-zero entries 
A' RtSt is small (•< n). The example code for an algorithm for sparse plus banded 
systems can be found in the supplementary material; the performance of this algorithm 
is presented in sec. 7.2. 



4. Algorithmic implementation 

Based on the expressions (25a, 256, 25c, 30, 33) derived in the paper, we have coded 
several different algorithms that efficiently solve the linear system in (1). The difference 
between the method in this paper and the most commonly used implementation of 
Gaussian elimination techniques, such as the ones included in LAPACK [34], Numerical 
Recipes in C [35], or those discussed in ref. [36] is that these methods perform an LU 
factorization of the matrix A, and the coefficients £ for the Gaussian elimination are 
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obtained in several steps, whereas the method introduced here does not perform such 
an LU factorization, and it obtains the coefficients £ in a single step. 

In order to obtain the solution of (1) we need to get the coefficients £ for Gaussian 
elimination as explained in section 4. That is, one diagonal coefficient for each 
row/column, plus m u coefficients in each row and mi coefficients in each column (except 
for the last ones, where less coefficients have to be calculated). More accuracy in the 
solution is obtained by pivoting, i.e., altering the order of the rows and columns in 
the process of Gaussian elimination so that the pivot (the element temporarily in the 
diagonal and by which we are going to divide) is never too close to zero. Double pivoting 
(in rows and columns) usually gives more accurate results than partial pivoting (in rows 
or columns). However, the former is seldom preferred for banded systems, since it 
requires 0(n 2 ) operations, while the latter requires only 0(n). In the implementations 
described in this section, we have chosen to perform partial pivoting on rows, as in 
refs. [34,35]. In the same spirit, and in order to save as much memory as possible, we 
store matrices by diagonals (see [35]). 

We proceed as follows: For each given /, we obtain £ n (using (25a)), and then 
£j/ (using (25c)) for J = I + 1, . . . , I + m^ If > \£n\, we exchange rows / and 
J in the matrix A and in the vector b. This is called partial pivoting in rows, and it 
usually gives greater numerical stability to the solutions; in our tests of section 6 the 
error was lowered in two orders of magnitude by partial pivoting. Next, we calculate 
(using {25b)) for J — I + 1, ... ,1 + m u . When we have calculated all the relevant 
coefficients £ for a given /, we calculate c/ using (30). We repeat these steps for all 
rows /, starting by / = 1 and moving one row at a time up to / = n. This ordering 
enables us to solve the system using eqs. (25a), (30), (33), because the superdiagonal £/j 
(i.e., those with I < J) only require the knowledge of the coefficients with a lower row 
index /, while the subdiagonal coefficients £/j with I > J only require the knowledge of 
coefficients with a lower column index. We have additionally implemented a procedure 
to avoid performing dummy summations (i.e., those where the term to add is null), 
which eliminates the need for evaluating nu in every step. According to the pivotings 
performed before starting to calculate a given £, a different number of terms will appear 
in the summation to obtain it. This procedure uses the previous pivoting (i. e., row 
exchanging) information and determines how many £ coefficients have to be obtained 
in any row or column, and how many terms the summation to obtain them will consist 
of (this procedure is not indicated in the pseudo-code below for the sake of simplicity). 
The final step consists of obtaining x from b using (33). 

The pseudo-code of the algorithm can be summarized as follows: 

/ / Steps 1 and 2: Calculating the coefficients £ and the vector c 
for (K = 1, K < n, K + +) do 
/ / Calculating the diagonal £'s: 

1 1 Calculating the subdiagonal £'s: 
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for (I = K + 1, I <I + m t , I + do 

ilK = ~Mk + Z~2m=K-ii ik ilM^MK for I > J 

end for 

/ / Pivoting: 

if 3 > |f // 1 for J = J + 1, . . . , I + mi then 
for (K = 1, K <n, K + +) do 

A/a- <->■ A JX 
end for 

6/ 6j 
end if 

/ / Calculating the superdiagonal f 's: 

for ( J = if + 1, J < K + m u , J + +) do 

C^J = -£kkA K j + J2m=k-h kj £km£mj 
end for 

// Calculating = (Qo)k- 
c K = b K 

for (L = K — mi, L < K — 1, L + +) do 

= £k,lCl 
end for 
end for 

/ / Step 3: Calculating xk = {Pc)k = (PQb)K- 
for (K = n, K>1, K - -) do 

for (L = K + 1, L < K + m u , L + +) do 

end for 
end for 

In the actual computer implementation we split the most external loop into three 
loops (7 = 1,..., 2m, I = 2m + 1, . . . , n — 2m, and I = n — 2m + 1, . . . , n), because the 
summations to obtain the coefficients £ lack some terms in the initial and final rows. 
We store A by diagonals inanx {2m u + mi + 1) matrix in order to save memory and, 
with the same objective, we overwrite the original entries Ajj with the calculated £/j 
for I < J, and we store the £/j with / > J in another n x (2m;) matrix. 

One possible modification to the algorithm presented above is to omit the pivoting. 
This usually leads to larger errors in the solution, but results in important computational 
savings. It can be used in problems where computational cost is more important than 
achieving a very high accuracy. In any case, one must note that the accuracy of the 
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algorithm is typically acceptable without pivoting, so in many cases no pivoting will be 
necessary. 

In (33) we can see that no subdiagonal coefficients with I > J) are needed 
to obtain x from c. In (30), we can see that only £jk are necessary in order to obtain 
Cj, thus making it unnecessary to know ^ LK for L < I. Therefore, we can get rid of 
them once cj is known. Since we calculate cj immediately after calculating all £ik, 
we can overwrite Ci+i,k on the memory position of £jk- If we do so, about one third 
of the memory is saved, since less coefficients must be stored, however, according to 
some preliminary tests, this option is also 20% slower than the simpler one in which all 
coefficients are stored independently. 

It is also worth remarking at this point that the present state of the algorithm is not 
yet completely optimized at the low level and, therefore, it cannot be directly compared 
to the thoroughly optimized routines included in commonly used scientific libraries such 
as LAPACK [34]. This further optimization will be pursued in future works. 

5. Parallelization 

There exist many works in the literature aiming at parallelizing the calculations needed 
to solve a banded system [1, 10,39-50]. The decision about which one to choose, and, 
in particular, which one to apply to the algorithms presented in this work depends, of 
course, on the architecture of the machine in which the calculations are going to be 
performed. The choice is additionally complicated by the fact that, normally, only the 
number of floating point operations required by each scheme is reported in the articles. 
However, the number of floating point operations is known to be a poor measure of the 
real wall-clock performance of computer algorithms and, especially, parallel ones, for a 
number of reasons: 

• Not all the floating point operations require the same time. For example, in 
currently common architectures, a quotient takes 4 times as many cycles as an 
addition or a product. 

• A floating point operation usually requires access to several positions of memory. 
Each access is much slower than the floating point operation itself [51]. Moreover, 
the number of memory accesses does not need to be proportional to the number of 
floating point operations. 

• Transferring information among nodes in a cluster is commonly much slower than 
accessing a memory position or performing a floating point operation [51]. 

Despite these unavoidable complexities and the fact that rigorous tests should be 
made in any particular architecture, two parallelizing schemes seem well suited for 
the method presented in this work: the one in ref. [41] for shared-memory machines 
and the one in ref. [10] for distributed-memory machines. The former is faster if 
the communication time among nodes tends to zero, whereas the latter tackles the 
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communication time problem by significantly reducing the number of messages that 
need to be passed. 

6. Differences with Gaussian elimination 

In order to asess the performance of the method derived in the previous sections, we will 
present results of numerical tests of real systems. In sec. 7, we compare the absolute 
accuracy and numerical efficiency of our New Algorithm with those of the banded solver 
described in the well-known book Numerical Recipes in C [35]. However, before doing 
that, we can make some general remarks about the validity of the new method from the 
numerical point of view. 

At this point, it is worth remarking that the present state of our New Algorithm for 
banded systems is not yet completely optimized at the low level. Therefore, it cannot 
be directly compared to the thoroughly optimized routines included in commonly used 
scientific libraries such as LAPACK [34]. This further optimization will be pursued in 
future works. At the current state, it is natural to compare our algorithm to an explicit, 
high-level, not optimized routine, such as the ones in Numerical Recipes in C [35], and 
the results here should be interpreted as a hint of the final performance when all levels 
of optimization are tackled. 

Our New Algorithm (NA) is based on equations (26a, 266, 26c), (30) and (33). 
The source code of its different versions can be found in the supplementary material. 
The solver of [35] (NRC) belongs to a popular family of algorithms (see, for example, 
[36]) which work by calculating the £ coefficients involved in the Gaussian elimination 
procedure in different iterations. Both for NA and for NRC, the £ coefficients required 
for the resolution result from the summation of several terms. For a given set of £'s, 
Gaussian elimination-based methods first obtain the first term of the summation of 
every £ then the corresponding second terms of the summations, and so on. In contrast, 
our method first obtains the final value of a given £ by calculating all the terms in the 
corresponding summation; then once a given £/j is known, it computes and so 

on. 

Both the NRC Gaussian elimination method for banded systems and our New 
Algorithm perform the same number of operations (i.e., the same number of additions, 
the same number of products, etc.). However, their efficiencies are different, as is shown 
in sec. 7.1. We believe this is due to the way the computers which run the algorithms 
access the memory positions which store the variables involved in the problem. The time 
that modern computers take to perform a floating point operation with two variables 
can be much smaller than the time required to access the memory positions of these two 
variables. In a modern computer, an addition or product of real numbers can take of the 
order of 1CT 9 -10~ 8 s. If the variables involved are stored in the cache memory, to access 
them can take also 10~ 9 -1CT 8 s. If they are stored in the main memory, the access can 
take the order of 1CT 7 s [51,52]. The speed to access one position of memory is given 
not only by the level of memory (cache, main memory, disk, etc.) where it lies, but 
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also by the proximity of the position of memory which was immediately read previously. 
Therefore, it is expected that two different algorithms will require different execution 
times if they access the computer memory in different ways, even if they perform the 
same operations with the same variables. 

In the NRC Gaussian elimination procedure, a given number of floating point 
variables is added to each entry Ajj. The same number of floating point variables 
has to be added to A u to calculate £jj in the New Algorithm. However, the order it 
is done is different in both cases. In Gaussian elimination, one row of A times a given 
number is added to another row of A, and this is repeated many times. For example, 
for erasing the subdiagonal entries of the first column of A, the first row (times the 
appropriate numbers) is added to rows 2 to mi + 1. Then, to erase the (new) second 
column, the (new) second row is added to rows 3 to mi + 2, and so on. Let us consider 
Ai,3, without a loss of generality. A given number is added to when the first row 
is added to its lower rows; after some steps, another number is added to A 43 , when 
the second row is added to its lower rows. Again after some steps, another number is 
added to A^ 3 (when the third row is added to its lower rows). In this procedure, the 
memory positions that are accessed move away from the position of A 4 3 , and then they 
come back to it, which can be suboptimal. However, in our New Algorithm, numbers 
are added to a memory position (say At j3 ) only once (see 25c), making the sweeping 
of memory positions more efficient. Some simple tests seem to support this hypothesis. 
We define the following loops: 

• Loop 1: (Analogous to loop of NRC-Gaussian elimination) 

for (K = 0, K < 1000000, K + +) do 
for (1 = 0, I <m h I + +) do 

for ( J = 0, J < m u - 1, J + +) do 

A[ mi ][J]+ = A[I][J] 
end for 
end for 
end for 

• Loop 2: (Analogous to the loop of the New Algorithm) 

for (K = 0, K < 1000000, K + +) do 

for (1 = 0, I < m u - 1, I + +) do 
A[mi][I}+ = A[0][I] + ... + A[m t - 2] [7] 

end for 
end for 

The way in which Loop 1 sweeps the memory positions is analogous to that of the 
Gaussian elimination method (NRC), because it adds the different numbers to a given 
memory position (A[m;][J]) in different iterations within the intermediate loop. The 
way Loop 2 sweeps the memory positions is analogous to that of the New Algorithm 
(NA), because it adds the different numbers to a given memory position (A[m;][/]) just 
at a stretch. If we compare the times required by their executions (using m u = 10), we 
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Table 1. Comparison of the execution times of simple tests. The last column 
corresponds to data taken from sec. 7. 



mi 


^-Loopl 


^Loop2 


^Loopl / ^Loop2 


tNRC/tNA 


3 


0.379 


0.169 


2.243 


1.145 


10 


1.255 


0.509 


2.466 


1.780 


30 


3.788 


1.473 


2.563 


2.360 



This simple test gives us a clue on how the different ways to sweep memory positions 
can result in rather different execution times. The comparison between £loopiAloo P 2 an< ^ 
tNRc/tNA associated with the actual NRC and NA algorithms (without pivoting, with 
N = 10 6 and m u = toj, see sec. 7) is merely qualitative. This is because the way the 
memory access takes place in Loop 1 is not exactly the same as the way the memory 
access takes place in NRC, nor is it the same for Loop 2 and NA (and also the mjs are 
different). The reason why the relative performance of NA vs. NRC decreases with m 
for m > 35 approximately can be due to the fact that in our implementation, NA uses 
matrices which are larger than those of NRC. 

7. Numerical tests 

In this section we quantitatively compare our New Algorithm with the banded solver 
based on Gaussian elimination of [35]. We do so by comparing the accuracy and 
efficiency of both algorithms for solving given systems. For the sake of generality, in the 
first part of this section (7.1) we use random banded systems as inputs for our tests. In 
the second part (7.2), we use them (plus a modified version of NA) to solve a physical 
problem, the calculation of the Lagrange multipliers in proteins. 

7.1. Performance for generic random banded systems 

In the test systems for our comparisons we imposed m u = mi = m for the sake of 
simplicity. We took n = 10 3 , 10 4 and 10 5 and m = 3, 10, 30, 100 and 300 for all of them 
in our tests. In addition to this, we took n = 10 6 with m — 3, 10 and 30. For each given 
pair of values of n and m, we generated a set of 1000 random n x n banded matrices 
whose entries are null, except the diagonal ones and their first m neighbours on the 
right and on the left. The value of these entries is a random number between 500 and 
—500 with 6 figures. The components of the independent term (the vector b in (4)) are 
random numbers between and 1000, also with 6 figures. We tested both algorithms 
(from [35] and our NA) with and without pivoting. We used PowerPC 970FX 2.2 GHz 
machines, and no specific optimization flags were given to the compiler apart from the 
basic one (g++ -o solver solver. cpp). Every point in our performance plots corresponds 
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to the mean of 1000 tests and, in each point, we used the same random system as the 
input for both algorithms. 

We measured the efficiency of a given algorithm by using an average of its execution 
times for given banded systems. In the measurement of such execution times, we 
considered only the computation time; i.e., the measured execution times correspond 
only to the solution of the banded systems, and not to other parts of the code such 
as the generation of random matrices and vectors, the initialization of variables or the 
checking and the storage of the results. The measurements of the execution times start 
immediately after the initialization, and the clock is stopped immediately after the 
unknowns x are calculated. The concrete information on how the measurement of times 
was implemented can be can be found in the source code of the programs used for our 
tests, which are included in the supplementary material. As is shown there, standard C 
libraries were used to measure the times. 



The accuracy of the algorithms was determined by measuring the error of the 
solutions they provide (x). We quantified the error with the following formula, which 
corresponds to the normalized deviation of Ax from b: 




Figure 1. Properties of the New Algorithm introduced in this work, with 
pivoting, as a function of the size of the matrix n and the width of the band 
m in random banded test systems, a) Its accuracy, as measured by the error 
defined in equation (39) . b) Its numerical efficiency, measured by the execution 
time. 



In the first diagrams of this section (figs. 1, 2, 3 and 4) we present the absolute and 
relative accuracy and efficiency of the NA and NRC algorithms for the cases with and 
without pivoting. In these figures, the yellow spheres represent the calculated points, 
which correspond to the average of 1000 tests with different input random matrices 
and vectors. For the sake of visual contort, interpolating surfaces have been produced 
with cubic splines and the x and y axes (labeled n and m) are in logarithmic scale. In 
figures 2, 4 we compare quantities between the two algorithms; a blue plane at z = 1 is 
included. Above this plane, NA is more competitive than NRC; below this plane, the 
converse is true. 
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In figure la, we can see that our algorithm with pivoting has very good accuracy, 
with the error satisfying log(Error) oc log(m). The error is proportional to a power 
of m with a small exponent (~ 1.4). In the same figure we notice that this error is 
approximately independent of n. The execution time in the tested region (see figure lb) 
is proportional to n and also approximately proportional to m 17 , not to m 2 as one 
would expect from the number of floating point operations (oc nm 2 ). This suggests that 
memory access is an important time-consuming factor, in addition to floating point 
operations. 



a) b) 




Figure 2. Comparison between the properties of the New Algorithm introduced 
in this work (NA) and the one in ref. [35] (NRC), both with pivoting, as a 
function of the size of the matrix n and the width of the band m in random 
banded test systems, a) Relative accuracy, as measured by the ratio of the 
errors defined in equation (39). b) Relative numerical efficiency, measured by 
the ratio of the execution times. 



In figure 2a, we can see that, if pivoting is performed, our New Algorithm is always 
more accurate than NRC, except for a narrow range of m between 1 and 4. The typical 
increase in accuracy is around a 5%, reaches almost 15% for some values of n and m. 
In figure 2b, we can see that, if pivoting is performed, the New Algorithm is also faster 
than NRC for most of the studied values of n and m, with typical speedups of around 
40% and the largest ones of almost 80%. 

If pivoting is not performed, the accuracy decreases typically by two or three 
orders of magnitude (but errors still remain very low, usually around 10~ 10 ).In the 
non-pivoting case, we also see that a few of the calculations (around 1 in 500) present 
errors significantly larger than the average. This probably suggests that the random 
procedure has produced a matrix that is close to singular with respect to the hypotheses 
introduced in sec. 2. In fig. 3, we show a typical example of the distribution of errors 
for the non-pivoting banded solvers NA and NRC in figure 3. The data corresponds to 
the errors of 1000 random input matrices with n = 10 5 and m = 10. In such a case, the 
average of the error is less representative. In this example test, the highest error in NRC 
is 1.57 • 10~ 9 , and in NA is 2.53 • 10~ 9 , although these numbers are probably anecdotal. 
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Figure 3. Histogram of the errors made by the algorithms NA and NRC for 
solving random banded systems without pivoting. The data corresponds to 
1000 random inputs with n = 10 5 and m = 10. 



One must also note that 99% of the errors are O(10~ 10 ) or smaller. A comparison of 
the red and green bars in the histogram suggests that there are no big differences in the 
errors of both algorithms (NA and NRC) without pivoting. 

Despite these problems in dealing with almost singular matrices, algorithms without 
pivoting have an important advantage regarding computational cost, and they can be 
useful for problems in which the matrices are a priori known to be well behaved. These 
computational savings are noticed if we compare figs, lb, and 4a. In figure 4b, we 
can additionally see that the New Algorithm introduced in this work is always faster 
than NRC for the explored values of n and m if no pivoting is performed; the increase 
in efficiency reaching almost to a factor of 3 for some values of n and m, and being 
typically around a factor of 2. 

7.2. Analytical calculation of Lagrange multipliers in a protein 

In Molecular Dynamics simulations, it is a common practice to constrain some of the 
internal degrees of freedom of the involved systems. This enables an increase in the 
simulation time step, makes the simulation more efficient, and is expected not to severely 
distort the value of the observable quantities calculated in the simulation [53,54]. The 
bond lengths of a molecule can be constrained by including algebraic restrictions such 
as the following one: 



in the system of classical equations of motion of the atoms. In this expression, the 
positions of atoms in a molecule formed by N a atoms are given by x a , Xp, with 
a,/3 = l,...,N a . The parameter a a< p is the length of the bond which links atoms 
a and f3. 



x a - xpf - (a Qj(3 ) 2 = 



(40) 
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Figure 4. a) Numerical complexity of the New Algorithm introduced in this 
work, without pivoting, when solving random banded test systems. Execution 
time is shown as a function of the size of the matrix n and the width of the 
band, b) Comparison between the numerical complexity of NA and NRC. 



The imposition of holonomic constraints such as (40) under the assumption of 
the D'Alembert principle makes the so-called constraint forces appear. These forces 
are proportional to their associated Lagrange multipliers, which have to be calculated 
in order to evaluate the dynamics of the system. Proteins, nucleic acids and other 
biological molecules have an essentially linear topology, which makes it possible to 
calculate the Lagrange multipliers associated to their constrained internal degrees of 
freedom by solving banded systems. More explanations on how to impose constraints 
on molecules and on how to calculate the Lagrange multipliers in biomolecules can be 
found in [22]. 

In this section, we compare the efficiencies and accuracies of three methods to solve 
the banded systems associated with the calculation of Lagrange multipliers of a family 
of relevant biological molecules (polyalanines). The three methods we compare are: 

• The Gaussian elimination algorithm for banded systems presented in [35] (NRC) 

• The New Algorithm (NA) presented here, based on equations (26a, 266, 26c) 

• A modified version of the New Algorithm presented here, which uses the methods 
discussed in sec. 3 and takes advantage in the symmetry of the system (i.e., it uses 
equation (27) instead of (26c)) 

All three methods are implemented without pivoting. The accuracies and efficiencies of 
the first two ones were compared in sec. 7.1 for banded matrices with random entries. 

In our tests, we calculated the Lagrange multipliers of a-helix shaped polyalanine 
chains (as the one displayed in fig. 5) with different numbers of residues (R). See [22] 
for further information on the way the systems of equations to solve were generated. In 
our tests, we measured the error as calculated with (39), as well as the execution time 
of the algorithms. We ran them in a MacBook6,l with a 2.26 GHz Intel Core 2 Duo 
processor. 

The results are displayed in figures 6 and 7. For all the polypeptide lengths 
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Figure 5. Polyalanine chain of 40 residues in a a-helix shape. White spheres 
indicate H atoms, dark spheres indicate C atoms, blue spheres indicate N 
atoms and red spheres indicate O atoms. The covalent bonds appear as rods 
connecting them. Diagram made with Avogadro [55]. 
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Figure 6. Comparison of the execution times (t) of different algorithms to solve 
banded systems in the calculation of the Lagrange multipliers in a polyalanine 
chain of R residues. Vertical crosses: NRC (Gaussian elimination); Diagonal 
crosses: NA; Squares: modified New Algorithm. 

represented in fig. 6, the execution time of the Gaussian elimination algorithm (NRC) is 
about 1.57 times the execution time of the New Algorithm (1.57 ± 0.01). The modified 
New Algorithm (squares in figures 6 and 7) is about 2.70 times faster than the NRC 
algorithm. These results were the expected results for the used values of n, m u and mi 
(m u = mi = m = 6, n = 10R + 2), according to the tendencies observed in the previous 
section. Higher values of m are expected to result in better relative efficiency of the 
New Algorithm (see sec. 7.1). A situation that we can meet, for example, if not only 
bond lengths, but also bond angles, are constrained, and if the branches of the molecule 
are longer (for example, the side chains of the arginine residue are longer than the side 
chains of the alanine residue). 

The errors made by the three tested algorithms are displayed in fig. 7. As expected 
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Figure 7. Comparison of the errors made by different algorithms in the 
solution of banded systems for the calculation of the Lagrange multipliers in a 
polyalanine chain of R residues. Vertical crosses: NRC (Gaussian elimination); 
diagonal crosses: NA; squares: modified New Algorithm. 



for the case without pivoting (see sec. 7.1), the errors of the NRC and NA algorithms 
are similar, and both are very small (similar to the errors arisen from the finite machine 
precision). The error of the modified version of the New Algorithm is typically less 
that half of the error of the other two methods. This can be due to the fact that the 
modified version uses equation (27) instead of (26c). Therefore, fewer numbers (about 
half of them) are present in the calculation, and hence fewer potential sources of error 
are present. 



We conclude that, for the systems tested in this section, the new algorithm 
introduced in this work is competitive both in accuracy and in computational efficiency 
when compared with a standard method for inverting banded matrices. This holds true 
both with and without pivoting. We stress we are comparing two algorithms which are 
not yet thoroughly optimized (as LAPACK is). 

8. Concluding remarks 

In this paper, we have introduced a new linearly scaling method to invert the banded 
matrices that so often appear in problems of Computational Physics, Chemistry and 
other disciplines. We have proven that this new algorithm is capable of being more 
accurate than standard methods based on Gaussian elimination at a lower computational 
cost, which opens the door to its use in many practical problems, such as the ones 
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described in the introduction. 

Moreover, we have produced the analytical expressions that allow us to directly 
obtain, in a recursive manner, the solution to the associated linear system in terms of 
the entries of the original matrix. To have these explicit formulae (which have also been 
presented for the calculation of the full inverse matrix in the Appendix) at our disposal 
not only simplifies the task of coding the needed computer algorithms, but it may also 
be useful to facilitate analytical developments in the problems in which banded matrices 
appear. 

In addition, we have checked its performance for general trial systems, and proven 
its usefulness for real physical problems (calculations on dynamics of proteins). 
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Appendix A. Inverse of a banded matrix 

In the previous sections, we proved that the banded linear system of n equations with 
n unknowns in (1) can be solved in order n operations. Sometimes, we are interested 
in obtaining the inverse matrix A^ 1 itself. We can do this in order n 2 operations using 
the same kind of ideas discussed in the main body of the article. It should be stressed 
that the explicit inverse of an arbitrary banded matrix usually cannot be obtained in 
0(n) floating point operations, since the inverse of a banded matrix has n 2 entries and 
it is not, in general, a banded matrix itself (an exception to this is a block diagonal 
matrix). In order to obtain an efficient way to invert A, we will derive some recursive 
relations between the rows of P (and Q). To this end, we will first calculate the explicit 
expression of the entries of these matrices. 

Using eqs. (5a), (6), and (7), from which (8a, 8b, 8c) and (28a) follow, and after 
some straightforward but long calculations, one can show that the aforementioned entries 
satisfy 

Pi j = Oj H UdorKJ, (1.1a) 

Pii=dn, (1-16) 
Pu = for / > J , (1.1c) 
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and 

Qu= II 6^for/>J, (1.2a) 

<xcf,j, mt (K > L)ec 

Qn = 1 , (1.26) 

Q LJ = for 7 < J . (1.2c) 

We call the summations appearing in the first line of each of these groups of expressions 
jump summations. There are two jump summations here, one from I to J with increasing 
indices (t) and m u neighbours, and a jump summation from I to J with decreasing 
indices (\) and mi neighbours, respectively. The jump summation provides us with an 
explicit expression for all entries in A' 1 , without the need to recursively refer to other 
entries. This can be useful in order to parallelize its calculation. 

As it can be seen in the expressions, that each product in the sums contains a 
number of coefficients £kl- The pairs of indices (K,L) which are included in a given 
product are taken from a set c. In turn, each term of the sum corresponds to a different 
set of pairs of indices c drawn from a set of sets of pairs of indices Cj Jrriu (in the 
case of Pjj) or Cj Jmi (in the case of Qu). Therefore, the only detail that remains to 
understand these 'jump summations' is to specify which are the elements of these latter 
sets. 

A given element c of either Cj Jniu or C\ Jmi can be expressed as 

c = {(K u L ± ), (K 2 , L 2 ), . . . , (K s , L s )} , (1.3) 
in such a way that C] Jm comprises all possible c's that comply with a number of rules: 

• K 1 — I, and L s = J. 

• K r < L r , for r = 1, . . . , S. 

• K r+ i = L r , for r = 1, . . . , S. 

• L r — K r < m u , for r = 1, . . . , S. 
Let us see an example: 

^ II ^ KL = ^3,4^4,5^5,6 + ^3,4^4,6 + ^3,5^5,6 • (1-4) 

The rules to determine the elements of Cj j are analogous to the ones above but 
they take into account that the indices decrease: 

• K\ — I, and Ls = J- 

• K r > L r , for r = 1, . . . , S. 

• K r+ i = L r , for r — 1, . . . , S. 

• K r — L r < mi, for r = 1, . . . , S. 
An example would be: 

n ^ kl = ^5,4^4,3^3,26,1 + 6,36,26,1 + 6,46,26,1 

cec£ 1]3 (K,L)ec 

+ 6,46,36,1 + 6,36,1 + 6,26,1 + 6,46,1 • (1-5) 
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If we first focus on P, it is easy to see that, according to the properties of the jump 
summation, we have 

e n z*l=zu + i e n ^ 
+ci,i + 2 e n ^ 



IKL 



+ -..+Cl,I + m u E II ^L- (1-6) 



^^/+m u , J,m u 



If we insert a multiplicative factor £jj at both sides and use (1.1a), this equation 
becomes equation (32a) obtained in sec. 2: 

min{I+m u ,n} 

Pu= E ^ kPk -J for/<J. (1.7) 

K=I+1 

An analogous expression for Q can be obtained in a similar way: 

min{ J+mi ,n} 

Qu= E for/>J. (1.8) 

L=J+1 

Now, if we define «i := min{m u , J — /}, and ji 2 '■= min{m;,J — J}, since A^ 1 = PQ 
(see (4)), we have that 



{A- V ) u = (PQ)jj = PikQkj = E P ikQkj 

K=l K=J 
n / I+ni \ I+/J.1 

= E E ^ = E fa I E 



Jf=J \L=7+1 / L=J+1 \/sT=J 

= £ 6l(^ _1 )lj for / < J , (1.9) 

L=I+1 

which is a recursive relationship for the superdiagonal entries of A^ 1 . 
Performing similar computations, we have 

i+m J+H2 
(A- 1 ) II =Z II + E Zil(A- 1 ) L j = Z ii + J2 tLj{A~ l ) IL , (1.10) 

L=I+1 L=J+1 

(A- 1 ) IJ = ZUA- 1 )^ for/>J. (1.11) 

L=J+1 

Using the last three equations, we can easily construct an algorithm to compute A^ 1 
in 0(n 2 ) floating point operations. This algorithm would first calculate (A _1 )„ n = £ nn . 
Then it would use (1.9) to obtain, in this order, (A _1 ) n _ ljn , (A~ 1 ) n _2, n , (A~ l )i n . 
These are the superdiagonal (I < J) entries of the n-th column. Then, it would use 
(1.11) to obtain, in this order, (A _1 ) n n „ 1 , (A _1 ) nin _ 2 , . . ., (A _1 ) nl , i.e., the subdiagonal 
(7 > J) entries of the n-th row. Once the n row and column of A" 1 are known, 
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(A _1 )„_i jn _i can be obtained with (1.10). Then (1.9) and (1.11) can be used to obtain 
the entries of this (n — 1) column and row, respectively. When calculating the entries of a 
column J, i.e., £kj with K < J, is always obtained before When calculating 

the entries of a row J, i.e., with I > K, £jk is always obtained before £i,k-i- This 
procedure can be repeated for all rows and columns of A^ 1 , and the calculation of the 
K-th row and column can be performed in parallel. 
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